Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: preserve the order of the keys #211

Merged
merged 10 commits into from
Nov 26, 2023
Merged

fix: preserve the order of the keys #211

merged 10 commits into from
Nov 26, 2023

Conversation

zookzook
Copy link
Owner

No description provided.

@zookzook zookzook self-assigned this Sep 18, 2023
@rhcarvalho
Copy link
Contributor

@zookzook do you prefer to discuss here or back at #211?

I'll provide here some initial feedback related to the present implementation.

First off, thanks for taking the time to implement this extra functionality ❤️


Setup

Here's how I set this up on a Livebook:

Mix.install(
  [
    {:csv, "~> 3.0"},
    {:exqlite, "~> 0.13.15"},
    {:kino_db, "~> 0.2.2"},
    {:kino_explorer, "~> 0.1.10"},
    {:kino,
     github: "livebook-dev/kino", ref: "06625859ef367c50169a280ae5fe33cf9a49a8f3", override: true},
    {:mongodb_driver, github: "zookzook/elixir-mongodb-driver", ref: "preserve-order"},
    {:tzdata, "~> 1.1"}
  ],
  config: [
    elixir: [time_zone_database: Tzdata.TimeZoneDatabase],
    mongodb_driver: [decoder: BSON.PreserverOrderDecoder]
  ]
)

Query with explicit projection

And an example query:

Mongo.find(mongo, "Message", %{},
sort: %{_id: -1}, 
limit: 10, 
projection: [_id: 0, from: "$user.name", at: "$created_at", message: "$data"])
# |> Enum.to_list()
|> Enum.map(fn doc -> 
  doc
  |> Map.get(:order)
  |> Enum.map(fn key -> {String.to_atom(key), Map.get(doc, key)} end)
end)

Since I pass a projection using a keyword list, I expect and get back the :order list in the requested order, and all documents have the very same order. Worked great 👍

I know generating atoms dynamically has its perils, but in this case it seems under control and does improve readability of the output: [[foo: "bar"], ...] instead of [[{"foo", "bar"}], ...]

Query without a projection

Next query, without a projection:

Mongo.find(mongo, "Message", %{},
sort: %{_id: -1}, 
limit: 10)
# |> Enum.to_list()
|> Enum.map(fn doc -> 
  doc
  |> Map.get(:order)
  |> Enum.map(fn key -> {String.to_atom(key), Map.get(doc, key)} end)
end)

The query above represents better what I was originally looking for: exploring a collection and getting the fields in the order that MongoDB has them stored.

This query reviewed an obvious shortcoming of my code to transform the result: it should be applied recursively to handle subdocuments.

This got me thinking maybe I could write a custom decoder that handles recursion while parsing the response from Mongo, and such that I don't need to post-process query results. @zookzook what do you think?

@zookzook
Copy link
Owner Author

This got me thinking maybe I could write a custom decoder that handles recursion while parsing the response from Mongo, and such that I don't need to post-process query results. @zookzook what do you think?

Keep in mind, that the result of the decoder is used by the driver. If you would change the result by converting the binaries into atoms the whole driver will not work anymore.

Try to use the collection or write just a little recursive function that converts the data into the desired format.

@rhcarvalho
Copy link
Contributor

This got me thinking maybe I could write a custom decoder that handles recursion while parsing the response from Mongo, and such that I don't need to post-process query results. @zookzook what do you think?

Keep in mind, that the result of the decoder is used by the driver. If you would change the result by converting the binaries into atoms the whole driver will not work anymore.

I'm not sure I fully understand the implications. Could you give me a code example (perhaps a link to library source on GitHub) where this is happening?

Try to use the collection or write just a little recursive function that converts the data into the desired format.

I read https://hexdocs.pm/mongodb_driver/1.1.0/Mongo.Collection.html, and understood that to use Mongo.Collection effectively I'd need to declare all fields of my documents upfront, for every collection, defeating the idea of reading and printing the documents from Mongo the way (order of fields) they are.

I'll try the recursive function.

@zookzook
Copy link
Owner Author

I'm not sure I fully understand the implications. Could you give me a code example (perhaps a link to library source on GitHub) where this is happening?

Check this module:
https://github.com/zookzook/elixir-mongodb-driver/blob/master/lib/mongo/stream.ex

@rhcarvalho
Copy link
Contributor

I'm not sure I fully understand the implications. Could you give me a code example (perhaps a link to library source on GitHub) where this is happening?

Check this module: https://github.com/zookzook/elixir-mongodb-driver/blob/master/lib/mongo/stream.ex

Is it that exec_command_session returns a "document" that gets decoded the same way that a document within a cursor gets decoded?

From the Mongo.Stream perspective, I couldn't find where the cursor docs are expected to be maps 😕
Looks like docs is an "untouched" list of anything.

Then looking at

def exec_command_session(session, cmd, opts) do
with {:ok, conn, new_cmd} <- Session.bind_session(session, cmd),
{:ok, _cmd, response} <- DBConnection.execute(conn, %Query{action: {:command, new_cmd}}, [], opts),
:ok <- Session.update_session(session, response, opts),
{:ok, {_flags, doc}} <- check_for_error(response, cmd, opts) do
{:ok, doc}
else

I could guess that DBConnection.execute is the thing that eventually returns "doc", but couldn't find where BSON.Decoder is used in the flow going from a query to MongoDB.... I need to fill in some knowledge gaps :)

(...)

I did find https://github.com/zookzook/elixir-mongodb-driver/blob/preserve-order/lib/mongo/messages.ex, so the "messages from Mongo server" are treated the same as the documents within a query cursor, is that right?

So if the result of running a command would not be represented as a map, then the library would error out trying to access map keys or doing pattern matching.

@rhcarvalho
Copy link
Contributor

In the meantime, I wrote the recursive function:

defmodule Mongo.Document do
  def to_list(doc, order_key \\ :order) when is_map(doc) and not is_struct(doc) do
    doc
    |> Map.pop(order_key, [])
    |> then(fn {order, doc} ->
      doc
      |> Map.to_list()
      |> Enum.sort_by(fn {key, _} -> Enum.find_index(order, &(&1 == key)) end)
    end)
    |> Enum.map(fn
      {key, value} when is_map(value) and not is_struct(value) -> {key, to_list(value, order_key)}
      {key, value} -> {key, value}
    end)
  end
end

That could help me (or someone else in the future) go from the proposed map + :order key to an ordered collection:

Mongo.find(mongo, "Message", %{},
  sort: %{_id: -1},
  limit: 10
)
|> Enum.map(&Mongo.Document.to_list/1)

@zookzook
Copy link
Owner Author

From the Mongo.Stream perspective, I couldn't find where the cursor docs are expected to be maps 😕
Looks like docs is an "untouched" list of anything.

No, there are a lot of pattern matches:

 with {:ok, %{"cursorsAlive" => [], "cursorsNotFound" => [], "cursorsUnknown" => [], "ok" => ok}} when ok == 1 <- Mongo.exec_command_session(session, cmd, stream.opts) do
        :ok
      end

I did find https://github.com/zookzook/elixir-mongodb-driver/blob/preserve-order/lib/mongo/messages.ex, so the "messages from Mongo server" are treated the same as the documents within a query cursor, is that right?

Yes, the response is like a wrapper document that contains the "real" documents.

So if the result of running a command would not be represented as a map, then the library would error out trying to access map keys or doing pattern matching.

Yes, that's right!

@zookzook
Copy link
Owner Author

What about:

  def to_list(doc, order_key \\ :order) do
    doc
    |> Map.get(order_key, [])
    |> Enum.map(fn key -> {key, Map.get(doc, key)} end)
    |> Enum.map(fn {key, value} -> {key, eval(value, order_key)} end)
  end

  defp eval(%{__struct__: _other} = elem, _order_key) do
    elem
  end

  defp eval(doc, order_key) when is_map(doc) do
    to_list(doc, order_key)
  end

  defp eval(xs, order_key) when is_list(xs) do
    Enum.map(xs, fn elem -> eval(elem, order_key) end)
  end

  defp eval(elem, _order_key) do
    elem
  end

the find-index-function may result in a bottleneck.

Copy link
Contributor

@rhcarvalho rhcarvalho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey Michael, thanks again for this PR ❤️

Thanks to you I understand a bit more of the underlying implementation of the library and can achieve what I wanted for exploring Mongo collections 💯

I wanted to leave a few suggestions for the documentation if this ends up in a future release.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm guessing this as an experiment to compare benchmarks with old vs new decoder 😄

How is this file meant to be used? Should it be in the existing ./examples directory instead of a new one?

The doc is represented as a map, so encoding it doesn't preserve any well-defined order. Same for the documents within items -- IIUC the fields :id, :age and :name could end up serialized in any order and the order could even be different for different items.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's right. I need to remove it:

Original encoder 3        9.54 K      104.79 μs     ±6.53%      103.79 μs      130.19 μs
Original encoder 2        9.53 K      104.93 μs     ±6.48%      104.13 μs      129.92 μs
Original encoder          9.50 K      105.24 μs     ±7.51%      103.63 μs      138.92 μs
Original encoder 1        9.26 K      107.97 μs    ±19.33%      105.75 μs      150.14 μs

Original encoder 3       15.00 K       66.66 μs     ±7.15%       66.13 μs       76.91 μs
Original encoder 2       14.99 K       66.72 μs     ±8.32%       66.21 μs       76.29 μs
Original encoder 1       14.97 K       66.82 μs     ±7.40%       66.25 μs       76.78 μs
Original encoder         14.96 K       66.85 μs     ±7.80%       66.33 μs       76.67 μs

The new encode is faster than the old one.

lib/mongo/messages.ex Outdated Show resolved Hide resolved
lib/mongo/messages.ex Outdated Show resolved Hide resolved
lib/mongo/messages.ex Outdated Show resolved Hide resolved
lib/mongo/messages.ex Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Outdated
end

defmodule BSON.PreserverOrderDecoder do
@moduledoc false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you be willing to post the relevant docs to this module in Hexdocs.pm? I'd volunteer to help write it, since getting docs either on hexdocs or IDE/language server would help discover this feature and decide whether one needs it or not.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, of course.

README.md Outdated Show resolved Hide resolved
README.md Outdated
decoder: MyPreserverOrderDecoder

```
The decode module is defined at compiler time. The driver provides two types of decoder:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The decode module is defined at compiler time. The driver provides two types of decoder:
The decoder module is defined at compile time. The driver provides two types of decoder:

@rhcarvalho
Copy link
Contributor

What about: (...) the find-index-function may result in a bottleneck.

Nice! I didn't consider performance. For my current needs it doesn't really matter. I'm running one-off queries on Livebook :)

Your version reads much better than mine.

Neither of us wrote tests, though 😊! I did manual-test with a few queries -- and while handling the ObjectId struct case I realized that depending on configuration, the :order key may or may not be present.

I think your to_list would return an empty list when the order_key is missing? One extra branching there to handle the case of a missing order_key might do it.

zookzook and others added 6 commits September 19, 2023 12:48
Co-authored-by: Rodolfo Carvalho <[email protected]>
Co-authored-by: Rodolfo Carvalho <[email protected]>
Co-authored-by: Rodolfo Carvalho <[email protected]>
Co-authored-by: Rodolfo Carvalho <[email protected]>
Co-authored-by: Rodolfo Carvalho <[email protected]>
Co-authored-by: Rodolfo Carvalho <[email protected]>
@zookzook
Copy link
Owner Author

Perhaps we should rename the default name :order to :__order__ to protect it from collisions.

@rhcarvalho
Copy link
Contributor

rhcarvalho commented Sep 22, 2023

Hey, I still have this in mind, just bandwidth has been limited :)

Mapping out our next steps here:

  • rename the default name :order to :__order__ to protect it from collisions
  • public documentation for decoders @rhcarvalho (update: the modules have only internal documentation, their existence if however documented in the README)
  • provide in documentation an example function mapping "maps with order field" to ordered lists (based on what we've already wrote in previous comments) @rhcarvalho
  • remove or document benchmarks (?)
  • anything else I missed? Feel free to edit.

By the way, today I learned about String.to_existing_atom through https://erlef.github.io/security-wg/web_app_security_best_practices_beam/common_web_application_vulnerabilities#atom-exhaustion, so that could be used in any example we may end up writing.

@rhcarvalho
Copy link
Contributor

I've solved the merge conflict in a local branch and I'm writing docs.

Wanted an opinion regarding configuring an alternative key instead of :__order__.

I thought exposing BSON.DecoderGenerator in docs might be premature/unnecessary, and we'd have more flexibility changing or removing it in future versions if it was kept as an implementation detail for now. I think less concepts also makes the driver easier to use.

What do you think of

  1. Use default :__order__
config :mongodb_driver,
  decoder: BSON.OrderPreservingDecoder
  1. Set custom key
config :mongodb_driver,
  decoder: {BSON.OrderPreservingDecoder, key: :original_order}

I also briefly considered not introducing BSON.OrderPreservingDecoder at all, but just some driver-level config or config to the default decoder, but didn't quite like it after all... choosing good names is always hard :)

@rhcarvalho
Copy link
Contributor

Ah, worth contrasting the above suggestion with what we have currently:

If you want to change the :order key then define a new decoder module:

defmodule MyPreserverOrderDecoder do
  @moduledoc false

  use BSON.DecoderGenerator, preserve_order: :the_key_order
end

(and on top of creating a module, one would still need to update the config too!)

@rhcarvalho
Copy link
Contributor

(I was reading a blog post looking for inspiration...)

A disadvantage of setting the decoder globally is that we cannot choose to open two connections with different behaviors, they are forced to use a single decoder implementation.

In tests we can still test each decoder module independently, that's okay -- I wrote some basic tests for each of the implementations we have, with and without order.

@zookzook what if we make the decoder configurable per connection? Apparently it would require wiring things together in Mongo, Mongo.Messages and Mongo.MongoDBConnection.

@rhcarvalho rhcarvalho mentioned this pull request Oct 19, 2023
* fix: handle read preferences without tags

* Fixed minor bug parsing url connection. Updated dependencies libraries. (#209)

* Fixed minor bug parsing url connection. Updated dependencies libraries.

* Incremented max_nesting option in credo refactor nestring

* Changed function ReadPreference.primary() by ReadPreference.merge_defaults() in topology test file

---------

Co-authored-by: Juan Antonio Jiménez <[email protected]>
Co-authored-by: Michael Maier <[email protected]>

* fix: applies the global timeout value to each query (#215)

* Document BSON.Decoder and BSON.PreserveOrderDecoder

But keep the documentation for those modules private for now.

Expand the documentation on Data Representation and how to deal with
preserving key order when encoding and decoding data.

* Test BSON decoders

* Document how to go from annotated maps with order keys to lists

---------

Co-authored-by: zookzook <[email protected]>
Co-authored-by: Juan Antonio <[email protected]>
Co-authored-by: Juan Antonio Jiménez <[email protected]>
@zookzook zookzook merged commit b360d87 into master Nov 26, 2023
3 checks passed
@zookzook zookzook deleted the preserve-order branch November 26, 2023 12:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants